{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Análisis unidimensional\n",
" \n",
"\n",
"## Objetivo\n",
"\n",
"Como ya se dijo anteriormente, el análisis unidimensional consiste en analizar individualmente a las variables (columnas de un DataFrame) para conocer sus características y su naturaleza. Para ello, se emplea mayormente gráficos, aunque tambien se pueden calcular valores estadísticos como el promdio, la mediana, la kurtosis, entre muchos otros.\n",
"\n",
"## Ejemplo de como graficar\n",
"\n",
"- [Ejemplo visualización](#8)\n",
"\n",
"## Técnicas de análisis\n",
"\n",
"1. [DataFrame.describe()](#3)\n",
"2. [Indicadores de tendencia central](#1)\n",
"3. [Indicadores de dispersión](#2)\n",
"4. [Gráficos de distribución](#4)\n",
"5. [Gráficos de comparación](#5)\n",
"6. [Gráficos de composición](#6)\n",
"7. [Utilizando Groupby](#7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Importar lbrerías y cargar datos\n",
"__[Fuente de los datos](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)__"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
X
\n",
"
Y
\n",
"
month
\n",
"
day
\n",
"
FFMC
\n",
"
DMC
\n",
"
DC
\n",
"
ISI
\n",
"
temp
\n",
"
RH
\n",
"
wind
\n",
"
rain
\n",
"
area
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
7
\n",
"
5
\n",
"
mar
\n",
"
fri
\n",
"
86.2
\n",
"
26.2
\n",
"
94.3
\n",
"
5.1
\n",
"
8.2
\n",
"
51
\n",
"
6.7
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
1
\n",
"
7
\n",
"
4
\n",
"
oct
\n",
"
tue
\n",
"
90.6
\n",
"
35.4
\n",
"
669.1
\n",
"
6.7
\n",
"
18.0
\n",
"
33
\n",
"
0.9
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
2
\n",
"
7
\n",
"
4
\n",
"
oct
\n",
"
sat
\n",
"
90.6
\n",
"
43.7
\n",
"
686.9
\n",
"
6.7
\n",
"
14.6
\n",
"
33
\n",
"
1.3
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
3
\n",
"
8
\n",
"
6
\n",
"
mar
\n",
"
fri
\n",
"
91.7
\n",
"
33.3
\n",
"
77.5
\n",
"
9.0
\n",
"
8.3
\n",
"
97
\n",
"
4.0
\n",
"
0.2
\n",
"
0.0
\n",
"
\n",
"
\n",
"
4
\n",
"
8
\n",
"
6
\n",
"
mar
\n",
"
sun
\n",
"
89.3
\n",
"
51.3
\n",
"
102.2
\n",
"
9.6
\n",
"
11.4
\n",
"
99
\n",
"
1.8
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" X Y month day FFMC DMC DC ISI temp RH wind rain area\n",
"0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0\n",
"1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0\n",
"2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0\n",
"3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0\n",
"4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import os\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Cargarlos desde la web\n",
"data = pd.read_csv('http://www.dsi.uminho.pt/~pcortez/forestfires/forestfires.csv')\n",
"\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame.describe()\n",
"\n",
"\n",
"Llamando a la función`describe()` de _Pandas_ se obtiene una matriz la cual tiene varios indicadores para cada una de las columnas. Con estos indicaroes se puede tener una visión general de los datos."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
X
\n",
"
Y
\n",
"
month
\n",
"
day
\n",
"
FFMC
\n",
"
DMC
\n",
"
DC
\n",
"
ISI
\n",
"
temp
\n",
"
RH
\n",
"
wind
\n",
"
rain
\n",
"
area
\n",
"
month_number
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517
\n",
"
517
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
517.000000
\n",
"
\n",
"
\n",
"
unique
\n",
"
NaN
\n",
"
NaN
\n",
"
12
\n",
"
7
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
top
\n",
"
NaN
\n",
"
NaN
\n",
"
aug
\n",
"
sun
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
freq
\n",
"
NaN
\n",
"
NaN
\n",
"
184
\n",
"
95
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
mean
\n",
"
4.669246
\n",
"
4.299807
\n",
"
NaN
\n",
"
NaN
\n",
"
90.644681
\n",
"
110.872340
\n",
"
547.940039
\n",
"
9.021663
\n",
"
18.889168
\n",
"
44.288201
\n",
"
4.017602
\n",
"
0.021663
\n",
"
12.847292
\n",
"
7.475822
\n",
"
\n",
"
\n",
"
std
\n",
"
2.313778
\n",
"
1.229900
\n",
"
NaN
\n",
"
NaN
\n",
"
5.520111
\n",
"
64.046482
\n",
"
248.066192
\n",
"
4.559477
\n",
"
5.806625
\n",
"
16.317469
\n",
"
1.791653
\n",
"
0.295959
\n",
"
63.655818
\n",
"
2.275990
\n",
"
\n",
"
\n",
"
min
\n",
"
1.000000
\n",
"
2.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
18.700000
\n",
"
1.100000
\n",
"
7.900000
\n",
"
0.000000
\n",
"
2.200000
\n",
"
15.000000
\n",
"
0.400000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
25%
\n",
"
3.000000
\n",
"
4.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
90.200000
\n",
"
68.600000
\n",
"
437.700000
\n",
"
6.500000
\n",
"
15.500000
\n",
"
33.000000
\n",
"
2.700000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
7.000000
\n",
"
\n",
"
\n",
"
50%
\n",
"
4.000000
\n",
"
4.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
91.600000
\n",
"
108.300000
\n",
"
664.200000
\n",
"
8.400000
\n",
"
19.300000
\n",
"
42.000000
\n",
"
4.000000
\n",
"
0.000000
\n",
"
0.520000
\n",
"
8.000000
\n",
"
\n",
"
\n",
"
75%
\n",
"
7.000000
\n",
"
5.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
92.900000
\n",
"
142.400000
\n",
"
713.900000
\n",
"
10.800000
\n",
"
22.800000
\n",
"
53.000000
\n",
"
4.900000
\n",
"
0.000000
\n",
"
6.570000
\n",
"
9.000000
\n",
"
\n",
"
\n",
"
max
\n",
"
9.000000
\n",
"
9.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
96.200000
\n",
"
291.300000
\n",
"
860.600000
\n",
"
56.100000
\n",
"
33.300000
\n",
"
100.000000
\n",
"
9.400000
\n",
"
6.400000
\n",
"
1090.840000
\n",
"
12.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" X Y month day FFMC DMC DC \\\n",
"count 517.000000 517.000000 517 517 517.000000 517.000000 517.000000 \n",
"unique NaN NaN 12 7 NaN NaN NaN \n",
"top NaN NaN aug sun NaN NaN NaN \n",
"freq NaN NaN 184 95 NaN NaN NaN \n",
"mean 4.669246 4.299807 NaN NaN 90.644681 110.872340 547.940039 \n",
"std 2.313778 1.229900 NaN NaN 5.520111 64.046482 248.066192 \n",
"min 1.000000 2.000000 NaN NaN 18.700000 1.100000 7.900000 \n",
"25% 3.000000 4.000000 NaN NaN 90.200000 68.600000 437.700000 \n",
"50% 4.000000 4.000000 NaN NaN 91.600000 108.300000 664.200000 \n",
"75% 7.000000 5.000000 NaN NaN 92.900000 142.400000 713.900000 \n",
"max 9.000000 9.000000 NaN NaN 96.200000 291.300000 860.600000 \n",
"\n",
" ISI temp RH wind rain \\\n",
"count 517.000000 517.000000 517.000000 517.000000 517.000000 \n",
"unique NaN NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN NaN \n",
"mean 9.021663 18.889168 44.288201 4.017602 0.021663 \n",
"std 4.559477 5.806625 16.317469 1.791653 0.295959 \n",
"min 0.000000 2.200000 15.000000 0.400000 0.000000 \n",
"25% 6.500000 15.500000 33.000000 2.700000 0.000000 \n",
"50% 8.400000 19.300000 42.000000 4.000000 0.000000 \n",
"75% 10.800000 22.800000 53.000000 4.900000 0.000000 \n",
"max 56.100000 33.300000 100.000000 9.400000 6.400000 \n",
"\n",
" area month_number \n",
"count 517.000000 517.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 12.847292 7.475822 \n",
"std 63.655818 2.275990 \n",
"min 0.000000 1.000000 \n",
"25% 0.000000 7.000000 \n",
"50% 0.520000 8.000000 \n",
"75% 6.570000 9.000000 \n",
"max 1090.840000 12.000000 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_description = data.describe(include='all')\n",
"data_description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualización\n",
"\n",
"\n",
"Esto es solamente un ejemplo de la manipulación de las librerías para graficar. En la celda de abajo, se grafica la distribución de los valores de la *temperatura* y se añaden líneas al gráfico las cuales muestran los indicares de tendencia central así como los límites de los valores atípicos.\n",
"\n",
"Para graficar los datos se utilizan estas dos librerías:\n",
"\n",
"- [Seaborn](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)\n",
"\n",
"- [matplotlib](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.axvline.html?highlight=axvline)\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"sns.set(color_codes=True)\n",
"sns.kdeplot(data['temp'], shade=True)\n",
"\n",
"# Agrega lineas verticales en los indicadores de la tendencia central\n",
"plt.axvline(data['temp'].mean(), color='g') # Agrega una línea color verde, la cual indica el promedio\n",
"plt.axvline(data['temp'].median(), color='black') # Agrega una línea color negro, la cual indica la mediana\n",
"plt.axvline(data_description['temp']['25%'], color='black') # Agrega una línea color negro, la cual indica el Q1\n",
"plt.axvline(data_description['temp']['75%'], color='black') # Agrega una línea color negro, la cual indica el Q3\n",
"\n",
"IQR = data_description['temp']['75%'] - data_description['temp']['25%']\n",
"\n",
"upper_outliers = data_description['temp']['75%'] + 1.5*IQR\n",
"lower_outliers = data_description['temp']['25%'] - 1.5*IQR\n",
"\n",
"# Agrega dos líneas de color rojo, las cuales indican los límites para los valores atípicos\n",
"plt.axvline(upper_outliers, color='r') \n",
"plt.axvline(lower_outliers, color='r')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indicadores de tendencia central \n",
"\n",
"\n",
"Los indicadores que son sirven para medir la tendencia central son:\n",
"\n",
"- **Media**"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"18.88916827852998, 18.88916827852998, 18.88916827852998\n"
]
}
],
"source": [
"from statistics import mean\n",
"\n",
"a1 = data['temp'].mean() # utilizando la librería pandas\n",
"a2 = mean(data['temp']) # utilizando la librería statistics\n",
"a3 = np.mean(data['temp']) # utilizando la librería numpy\n",
"\n",
"print(f'{a1}, {a2}, {a3}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Mediana**"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"19.3, 19.3, 19.3\n"
]
}
],
"source": [
"from statistics import median\n",
"\n",
"a1 = data['temp'].median() # pandas\n",
"a2 = median(data['temp']) # statistics\n",
"a3 = np.median(data['temp']) # numpy\n",
"\n",
"print(f'{a1}, {a2}, {a3}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Cuantiles**\n",
"\n",
"Son valores que dividen a los datos en cuatro partes iguales.\n",
"\n",
" - 1er cuantil (Q1) \t25% de los datos es menor que o igual a este valor.\n",
" - 2do cuantil (Q2) \tLa mediana. 50% de los datos es menor que o igual a este valor.\n",
" - 3er cuantil (Q3) \t75% de los datos es menor que o igual a este valor. \n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indicadores de dispersión\n",
"\n",
"\n",
"Estos indicares muestran cuan dispersos están los datos.\n",
"\n",
"- **Varianza, desviación estándar**\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Desviación estándar: 5.806625349573505, 5.806625349573505, 5.801006939598366\n",
"Varianza: 33.71689795030963, 33.71689795030963, 33.65168151326841\n"
]
}
],
"source": [
"from statistics import stdev\n",
"from statistics import variance\n",
"\n",
"std1 = data['temp'].std() # pandas\n",
"std2 = stdev(data['temp']) # statistics\n",
"std3 = np.std(data['temp']) # numpy\n",
"\n",
"varianza1 = data['temp'].var() # pandas\n",
"varianza2 = variance(data['temp']) # statistics\n",
"varianza3 = np.var(data['temp']) # numpy\n",
"\n",
"print(f'Desviación estándar: {std1}, {std2}, {std3}')\n",
"print(f'Varianza: {varianza1}, {varianza2}, {varianza3}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Valores atípicos**\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Asimetría (skweness). Grado de simetría de la distribución.**\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-0.331172237347285 -0.3302106140354586\n"
]
}
],
"source": [
"from scipy.stats import skew\n",
"\n",
"a1 = data['temp'].skew() # pandas\n",
"a2 = skew(data['temp']) # scipy\n",
"\n",
"print(a1, a2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Curtosis. Indicador de la \"anchura\" de una distribución**\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.1361655076587991 0.12326917606611909\n"
]
}
],
"source": [
"from scipy.stats import kurtosis\n",
"\n",
"a1 = data['temp'].kurt() # pandas\n",
"a2 = kurtosis(data['temp']) # scipy\n",
"\n",
"print(a1, a2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gráficos de distribución\n",
"\n",
"\n",
"Para visualizar los gráficos, se puede utilizar, en vez de librerias externas como en el ejemplo de arriba, la librería *pandas*. __[Panda's Plots](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)__ \n",
"\n",
"- **[Histograma](https://es.wikipedia.org/wiki/Histograma)**\n",
"\n",
"Permite visualizar las frecuencias de diferentes categorías o rangos de valores (clases o bins)\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"